Showing posts with label Unicode 16.0. Show all posts

Tuesday, October 29, 2024

Script Encoding and Cultural Identity: Navigating Digital Exclusion

By Maroua Bezzaoui, SILICON Intern

During the summer of 2024, Unicode’s internship program included interns from Stanford University, Northeastern University, and Google’s Summer of Code. Several of the interns have shared their experiences. The second featured piece is from Maroua Bezzaoui at Stanford University.

ICU 76 Released

Unicode® ICU 76 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR).

ICU 76 updates to Unicode 16 (blog), including new characters and scripts, emoji, collation & IDNA changes, and corresponding APIs and implementations. It also updates to CLDR 46 (beta blog) locale data with new locales, significant updates to existing locales, and various additions and corrections. For example, the CLDR and Unicode default sort orders are now very nearly the same.

Most of the java.time (Temporal) types can now be formatted directly using the existing ICU4J date/time formatting classes.

There are some new APIs to make ICU easier to use with modern C++ and Java patterns. Most of the C/C++ APIs added for this purpose are implemented as C++ header-only APIs, and usable on top of binary stable C APIs, which is a first for ICU.

The Java and C++ technology preview implementations of the (also in tech preview) CLDR MessageFormat 2.0 specification have been updated to match recent changes.

ICU 76 and CLDR 46 are major releases, including a new version of Unicode and major locale data improvements.

For details, please see
https://unicode-org.github.io/icu/download/76.html.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, September 10, 2024

Announcing The Unicode® Standard, Version 16.0

Version 16.0 of the Unicode Standard is now available. This is a major version update that includes new characters and code charts, new data files and annexes, an updated core specification, and updated annexes and synchronized standards.

This version adds 5,185 new characters, including 3,995 additional Egyptian Hieroglyph characters plus seven new scripts, seven new emoji characters, and over 700 symbols from legacy computing environments, for a total of 154,998 characters. See the delta code charts for details on all the new scripts and characters. For additional details regarding new emoji, see Emoji Recently Added, v16.0.

In addition to new characters, new “Moji Jōhō Kiban” (文字情報基盤) Japanese source references have been added for over 36,000 CJK unified ideographs. This is reflected in the code charts for virtually all CJK unified ideograph blocks by additional representative glyphs in the “J” column.

The core specification for Version 16.0 is now available for browsing online as per-chapter web pages with “breadcrumb” and other links for easy navigation.

Two new annexes have been added to this version:

UAX #53, Unicode Arabic Mark Rendering: This annex, which was previously published as a Technical Report, specifies an algorithm for handling combining marks when rendering to ensure correct and consistent display of Arabic script text.
UAX #57, Unicode Egyptian Hieroglyph Database (Unikemet): This annex documents the format of the Unikemet.txt data file, which provides information clarifying the identity of Egyptian Hieroglyph characters and properties useful for implementations.

For complete details on Unicode Version 16.0, see https://www.unicode.org/versions/Unicode16.0.0/.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, August 27, 2024

Highlights from Unicode Technical Meeting #180

by Peter Constable, UTC Chair

Unicode Technical Committee (UTC) meeting #180 was held July 23 – 25 in Redmond, Washington, hosted by Microsoft. Here are some highlights.

Finalizing Unicode 16.0

One priority was to finalize technical decisions for Unicode 16.0 in preparation for a September 10 release. Beta feedback and a small number of new proposals were considered and various decisions affecting 16.0 were taken. Regarding the set of encoded characters and emoji sequences for Unicode 16.0, no changes were made from the Beta.

Unicode 16.0 will include major additions and improvements for Egyptian Hieroglyphs, most of which were already included in the Beta. One aspect of the improvements is a refinement in the encoding model for rotational variants using variation sequences. Since the Beta, it was recognized that ten of the Egyptian Hieroglyph encoded as characters in Unicode 5.2 would be better represented using rotational variation sequences. This led to some new UTC decisions affecting the 16.0 release:

Ten standardized variation sequences for Egyptian Hieroglyph rotational variants were added, while one standardized variation sequence that had been added in Unicode 15.0 was rescinded.
In the Unikemet.txt data file with Egyptian Hieroglyph properties, the kEH_Core property has been changed from a binary property to having an enumeration of values, one of which is “L(egacy)” indicating characters encoded in Unicode 5.2 that are not part of the core set and are not expected to be supported in fonts.

Another significant change affecting the 16.0 release is a glyph change for U+0620 ARABIC LETTER KASHMIRI YEH, and a change to its joining group in ArabicShaping.txt (180-C23, 180-C24). This affects not only the glyph shown in the code chart, but also the positional forms shown in the Arabic section of the core spec. The need for this arose from incorrect information in the core spec resulting in fonts that don’t provide a final form that matches users’ expectations. See L2/24-152 for background details.

While no further changes were made to the set of emoji in Unicode 16.0, a change will be made in how emoji characters are displayed in the code charts. The technology used to produce the chart pages is not able to display full-color emoji, and up to now the code charts have not made it clear when pictographic symbols have the Emoji property. In Unicode 16.0, characters with the Emoji property will be indicated in the code charts with a small triangular badge in the top left corner of the cell. A white triangle will indicate an emoji character that should have default emoji (full color) presentation:

A black triangle will indicate an emoji character that should have default text (monochrome) presentation:

A rectangular sign with a grid and numbers

The script descriptions in the core spec are used to provide background information on each script as well as information to guide implementations. For many scripts, it has been a challenge to provide comprehensive guidance for implementations, particularly when there are complex rendering requirements. However, some implementers have written Unicode Technical Notes providing guidance for implementation of a particular script. Although these are not normative specifications approved by UTC, they can still be valuable information conducive to interoperable implementations. For Unicode 16.0, UTC decided to have two existing UTNs referenced within the core spec sections for the respective scripts:

As mentioned for the Beta, the core spec for Unicode 16.0 will be published as per-chapter HTML pages.

Characters for future versions

At UTC #180, code points were provisionally assigned for 1,063 new characters, including 38 Arabic characters, 45 characters for phonetic transcription, and 965 ideographs and radicals for Jurchen script. With these characters in the pipeline, work can get started on property data, charts, and other content that will be needed for them to be encoded in a future version of the standard.

Some initial decisions were also taken on the character additions for Unicode 17.0: as IRG had finalized its recommendations for CJK Unified Ideographs Extension J, that block of 4,300 new ideographs has been approved for encoding in Unicode 17.0.

Also, a proposal was approved to disunify one existing CJK unified ideograph character, U+5CC0 in Unicode 17.0. When U+5CC0 was encoded in Unicode 1.1, it was deemed that two similar ideographs should be unified. The proposal demonstrated that this unification should not have been made, and that was confirmed earlier this year by IRG. The changes for 17.0 will include encoding of a new character, U+2B73A, and revision of the source references for 5CC0, 2B73A and 2F879. A complication in this case is that ideographic variation sequences for the two distinct glyphs have been registered for use in Japan. No changes in “J” source references will be made, and it is not expected that implementations for Japanese will be affected. For additional information, see section 7 of L2/24-165.

Variation sequences and historic scripts

People working with historic scripts often deal with glyph variations. Variation sequences seem like an appropriate encoding mechanism to use in such cases, though asking UTC to standardize variation sequences for many historic variations could seem like a challenge. With that in mind, a proposal was presented to encode a block of additional "user-defined" variation selector characters. These would be additional PUA characters with a constraint that they would only be used as variation selectors.

That proposed solution is problematic: existing stability policies and commitments prevent assigning more PUA code points and also prevent constraining existing PUA for certain uses. At the same time, there was opinion within UTC that the need expressed was reasonable, and there was openness to considering alternative solutions. One potential alternative that gained some interest was to establish a registration process, similar to what is defined in UTS #37 for ideographic variation sequences but intended for use with historic scripts.

For complete details on outcomes from UTC #180, see the draft minutes.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Tuesday, May 21, 2024

Unicode 16.0 Beta Review Open

The beta review period for Unicode® 16.0 has started and is open until July 2,2024.

The beta is intended primarily for review of character property data and changes to algorithm specifications (Unicode Standard Annexes). Also, for the first time, a complete draft of the core specification text is available for review during the beta period.

At this phase of a release, the character repertoire is considered stable. For this release, 5,185 new characters will be added, bringing the total number of encoded characters in Unicode 16.0 to 154,998. The new additions include seven new scripts:

Garay is a modern-use script from West Africa
Gurung Khema, Kirat Rai, Ol Onal, and Sunuwar are four modern-use scripts from Northeast India and Nepal
Todhri is an historic script used for Albanian
Tulu-Tigalari is an historic script from Southwest India

Other character additions include seven new emoji characters plus 3,995 additional Egyptian Hieroglyphs and over 700 symbols from legacy computing environments. See the delta code charts for details on all the new scripts and characters.

In addition to new characters, new “Moji Jōhō Kiban” (文字情報盤) Japanese source references will be added for over 36,000 CJK unified ideographs. This will be reflected in the code charts for virtually all CJK unified ideograph blocks by additional representative glyphs in the “J” column. Note that these glyph additions are not reflected in the delta charts mentioned above, but can be seen in the main (“single-block”) charts for the Unicode 16.0 Beta.

Various changes to properties, algorithms, and Unicode Standard Annexes will be made for Unicode 16.0. This version will add two new Unicode Standard Annexes:

UAX #53, Unicode Arabic Mark Rendering, provides a specification for interoperable font and shaping implementations for Arabic script. (This was previously published separately from the Unicode Standard as a technical report.)
UAX #57, Unicode Egyptian Hieroglyph Database (Unikemet), provides data essential for understanding the identity of over 5,100 Egyptian Hieroglyph characters encoded in Unicode 16.0. (This is similar to data for CJK unified ideographs provided in UAX #38.)

A new UCD file, DoNotEmit.txt, will provide data in machine readable form that can be useful for software implementations but that previously was provided only as tables within the core specification text. See the Unicode 16.0 Beta landing page for other noteworthy property and algorithm changes.

For full details regarding the Beta, see Public Review Issue #502. Feedback should be reported under PRI #502 using the Unicode contact form by July 2, 2024.

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
🕉️💗🏎️🐨🔥🚀爱₿♜🍀

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock

Monday, November 13, 2023

UTC #177 Highlights

by Peter Constable, UTC Chair

Unicode Technical Committee (UTC) meeting #177 was held November 1 to 3 in Cupertino, California, hosted by Apple. Here are some highlights from the meeting.

Starting the Unicode 16.0 cycle

UTC approved a plan and timeline for the Unicode 16.0 release. Here’s a summary of the timeline:

January 2024: UTC #178 will finalize content for the alpha release
February – March: alpha release for public review
April: UTC #179 will finalize content for the beta release
May – June: beta release for public review
July: UTC #180 will finalize 16.0 content
September: Unicode 16.0 release

UTC is still adjusting to changes in how work for each release is managed. So, while this will be a “full” release, UTC will be conservative about taking on too many changes, particularly to algorithm specifications (UAXes, UTSes). Also, a new format for the core text will be used in this release: instead of PDF, it will be published using Web technologies (HTML, etc.) To get early validation on format changes, the alpha release will include a sampling of content from the core text.

Unicode 16.0 character and emoji repertoire

UTC had previously approved 1,179 characters for encoding in Unicode 16.0. At this UTC meeting, 15 additional characters were approved for version 16.0, including seven emoji characters. UTC has been planning to include nearly 4,000 additional Egyptian Hieroglyphs in Unicode 16.0. The proposal was discussed, and a small revision was requested. It’s expected these will be approved for Unicode 16.0 at the next UTC meeting. Apart from the additional hieroglyphs, we expect no further characters will be added to the Unicode 16.0 repertoire.

Beside characters approved for Unicode 16.0, code points were provisionally assigned for 184 new characters that are candidates for encoding in a future Unicode version.

See the Pipeline page for all characters currently approved for Unicode 16.0, along with code points provisionally assigned for future encoding.

Future of UAX #42, UCD in XML

UAX #42, Unicode Character Database in XML (UCDXML), was originally developed by Eric Muller. He and Laurentiu Iancu maintained UCDXML through many versions, and we’re very grateful for this contribution. Eric and Laurentiu are no longer available to maintain this, however, and no others have volunteered to take over maintenance. After discussion over several months in UTC and in the Properties and Algorithms working group, UTC has concluded the best option for the future of UAX #42 is to stabilize it, with data frozen at Unicode 15.1. A Public Review Issue will be posted to get feedback on this plan.

Future maintenance of UCS repertoire

UTC discussed a proposal for ISO/IEC JTC 1/SC 2 to adopt different process for future maintenance of the repertoire of ISO/IEC 10646 using a maintenance agency rather than the process that is used for developing entirely new standards, as done in the past. It was felt that this would be more agile and would align better to how expert input has guided actual encoding decisions for several years now. This proposal will be formally submitted to JTC 1/SC 2 as a proposal from the US national standards body.

Full details on these and other outcomes are provided in the minutes—see L2/23-231.

Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Wednesday, November 1, 2023

What do a leafless tree, a fingerprint, and a harp have in common?

This is not a set up to a riddle. This is Emoji 16.0.

By Jennifer Daniel, Chair of the ESC

This week, the Unicode Technical Committee gathered for our last meeting of 2023 to discuss the encoding, data files, and list of characters related to digitizing the world’s languages. Amongst the topics discussed were emoji and as a result seven new characters are on their way for inclusion into the Unicode Standard, into your keyboards, and into your hearts ;-)

The final recommendations culminated in seven emoji: one emoji per major category.

An incredibly powerful aspect of written language is that it consists of a finite number of characters that can “do it all”. And yet, as the emoji ecosystem has matured over time our keyboards have ballooned and emoji categories are about to hit — or have hit — a level of saturation. Upon reflecting on how emoji are used, the Unicode Emoji Subcommittee (ESC) has entered a new era where the primary way for emoji to move forward is not merely to add more of them to the Unicode Standard, but to consider how the ones added provide the most linguistic flexibility. As a result, the ESC approves fewer and fewer emoji proposals every year.

The few that are added this year have demonstrated their adaptability in different contexts — take for example, fingerprint. It is commonly used to represent multiple concepts. Fingerprints are a symbol of identity (unique as you), security (as a passkey), and forensics (what crime show logo is complete without a fingerprint?). While we think of fingerprints as a relatively modern phenomenon according to Forensics Digest, the earliest use of fingerprints dates back to 1000 B.C.

In fact all of this year’s emoji candidates have deep roots in history. Harps have been known since antiquity in Asia, Africa, and Europe, dating back at least as early as 3000 BCE. Today it has political, sporting, corporate, and religious symbolism 👼 Leafless trees have been around as long as ... well, trees (and poetry!) I suppose. Leafless trees literally represent droughts or winter and metaphorically indicate a state of barrenness and death.

Shovel isn’t just another noun — sure, yes, it’s a tool commonly found in your shed — in our keyboards, however, it’s also a verb. Digging yourself out of a hole, digging yourself into a hole, shoveling 💩, it does it all. But wait, there’s more. Splatter is one of those stealth emoji that when you look at you might be thinking, “really, another sex emoji?” (To be honest, show me someone who doesn’t think an emoji is a sex emoji and I’ll show you someone who lacks imagination). Splatter is a spill. Splatter is expressive. Splatter is soft — a perfect counterpoint to collision 💥 — the bouba to 💥’s kiki.

When can you get these new emoji?

A simple question that deserves a simple answer. Alas, you’re dealing with Unicode so the answer is complex. Did you know it can take up to two years to encode an emoji? It’s true. If we want the symbols we digitize to truly “just work” across the entirety of not just the Internet but all digital surfaces … it takes time. So, don’t expect to see these characters anytime soon. In fact, despite the previous batch of emoji (phoenix, lime, broken chain, etc.) getting approved last year they still haven’t landed on your device of choice yet but are well on their way to pop up in the first half of 2024.

Emoji 16.0 has a long road ahead and will appear on most devices in May-June 2025.

Tuesday, October 29, 2024

Script Encoding and Cultural Identity: Navigating Digital Exclusion